34 research outputs found
Approximately Minwise Independence with Twisted Tabulation
A random hash function is -minwise if for any set ,
, and element , .
Minwise hash functions with low bias have widespread applications
within similarity estimation.
Hashing from a universe , the twisted tabulation hashing of
P\v{a}tra\c{s}cu and Thorup [SODA'13] makes lookups in tables of size
. Twisted tabulation was invented to get good concentration for
hashing based sampling. Here we show that twisted tabulation yields -minwise hashing.
In the classic independence paradigm of Wegman and Carter [FOCS'79] -minwise hashing requires -independence [Indyk
SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple
tabulation, using same space and lookups yields -minwise
independence, which is good for large sets, but useless for small sets. Our
analysis uses some of the same methods, but is much cleaner bypassing a
complicated induction argument.Comment: To appear in Proceedings of SWAT 201
One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers
Sketches are probabilistic data structures that can provide approximate
results within mathematically proven error bounds while using orders of
magnitude less memory than traditional approaches. They are tailored for
streaming data analysis on architectures even with limited memory such as
single-board computers that are widely exploited for IoT and edge computing.
Since these devices offer multiple cores, with efficient parallel sketching
schemes, they are able to manage high volumes of data streams. However, since
their caches are relatively small, a careful parallelization is required. In
this work, we focus on the frequency estimation problem and evaluate the
performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid.
As a sketch, we employed the widely used Count-Min Sketch. To hash the stream
in parallel and in a cache-friendly way, we applied a novel tabulation approach
and rearranged the auxiliary tables into a single one. To parallelize the
process with performance, we modified the workflow and applied a form of
buffering between hash computations and sketch updates. Today, many
single-board computers have heterogeneous processors in which slow and fast
cores are equipped together. To utilize all these cores to their full
potential, we proposed a dynamic load-balancing mechanism which significantly
increased the performance of frequency estimation.Comment: 12 pages, 4 figures, 3 algorithms, 1 table, submitted to EuroPar'1
Quicksort, Largest Bucket, and Min-Wise Hashing with Limited Independence
Randomized algorithms and data structures are often analyzed under the
assumption of access to a perfect source of randomness. The most fundamental
metric used to measure how "random" a hash function or a random number
generator is, is its independence: a sequence of random variables is said to be
-independent if every variable is uniform and every size subset is
independent. In this paper we consider three classic algorithms under limited
independence. We provide new bounds for randomized quicksort, min-wise hashing
and largest bucket size under limited independence. Our results can be
summarized as follows.
-Randomized quicksort. When pivot elements are computed using a
-independent hash function, Karloff and Raghavan, J.ACM'93 showed expected worst-case running time for a special version of quicksort.
We improve upon this, showing that the same running time is achieved with only
-independence.
-Min-wise hashing. For a set , consider the probability of a particular
element being mapped to the smallest hash value. It is known that
-independence implies the optimal probability . Broder et al.,
STOC'98 showed that -independence implies it is . We show
a matching lower bound as well as new tight bounds for - and -independent
hash functions.
-Largest bucket. We consider the case where balls are distributed to
buckets using a -independent hash function and analyze the largest bucket
size. Alon et. al, STOC'97 showed that there exists a -independent hash
function implying a bucket of size . We generalize the
bound, providing a -independent family of functions that imply size .Comment: Submitted to ICALP 201
Picture-Hanging Puzzles
We show how to hang a picture by wrapping rope around n nails, making a
polynomial number of twists, such that the picture falls whenever any k out of
the n nails get removed, and the picture remains hanging when fewer than k
nails get removed. This construction makes for some fun mathematical magic
performances. More generally, we characterize the possible Boolean functions
characterizing when the picture falls in terms of which nails get removed as
all monotone Boolean functions. This construction requires an exponential
number of twists in the worst case, but exponential complexity is almost always
necessary for general functions.Comment: 18 pages, 8 figures, 11 puzzles. Journal version of FUN 2012 pape
Triangle Counting in Dynamic Graph Streams
Estimating the number of triangles in graph streams using a limited amount of
memory has become a popular topic in the last decade. Different variations of
the problem have been studied, depending on whether the graph edges are
provided in an arbitrary order or as incidence lists. However, with a few
exceptions, the algorithms have considered {\em insert-only} streams. We
present a new algorithm estimating the number of triangles in {\em dynamic}
graph streams where edges can be both inserted and deleted. We show that our
algorithm achieves better time and space complexity than previous solutions for
various graph classes, for example sparse graphs with a relatively small number
of triangles. Also, for graphs with constant transitivity coefficient, a common
situation in real graphs, this is the first algorithm achieving constant
processing time per edge. The result is achieved by a novel approach combining
sampling of vertex triples and sparsification of the input graph. In the course
of the analysis of the algorithm we present a lower bound on the number of
pairwise independent 2-paths in general graphs which might be of independent
interest. At the end of the paper we discuss lower bounds on the space
complexity of triangle counting algorithms that make no assumptions on the
structure of the graph.Comment: New version of a SWAT 2014 paper with improved result
Dynamic Compressed Strings with Random Access
We consider the problem of storing a string S in dynamic compressed form, while permitting operations directly on the compressed representation of S: access a substring of S; replace, insert or delete a symbol in S; count how many occurrences of a given symbol appear in any given prefix of S (called rank operation) and locate the position of the ith occurrence of a symbol inside S (called select operation). We discuss the time complexity of several combinations of these operations along with the entropy space bounds of the corresponding compressed indexes. In this way, we extend or improve the bounds of previous work by Ferragina and Venturini [TCS, 2007], Jansson et al. [ICALP, 2012], and Nekrich and Navarro [SODA, 2013]
Yes, There is an Oblivious RAM Lower Bound!
An Oblivious RAM (ORAM) introduced by Goldreich and Ostrovsky [JACM\u2796] is a (possibly randomized) RAM, for which the memory access pattern reveals no information about the operations performed. The main performance metric of an ORAM is the bandwidth overhead, i.e., the multiplicative factor extra memory blocks that must be accessed to hide the operation sequence. In their seminal paper introducing the ORAM, Goldreich and Ostrovsky proved an amortized bandwidth overhead lower bound for ORAMs with memory size . Their lower bound is very strong in the sense that it applies to the ``offline\u27\u27 setting in which the ORAM knows the entire sequence of operations ahead of time.
However, as pointed out by Boyle and Naor [ITCS\u2716] in the paper ``Is there an oblivious RAM lower bound?\u27\u27, there are two caveats with the lower bound of Goldreich and Ostrovsky: (1) it only applies to ``balls in bins\u27\u27 algorithms, i.e., algorithms where the ORAM may only shuffle blocks around and not apply any sophisticated encoding of the data, and (2), it only applies to statistically secure constructions. Boyle and Naor showed that removing the ``balls in bins\u27\u27 assumption would result in super linear lower bounds for sorting circuits, a long standing open problem in circuit complexity. As a way to circumventing this barrier, they also proposed a notion of an ``online\u27\u27 ORAM, which is an ORAM that remains secure even if the operations arrive in an online manner. They argued that most known ORAM constructions work in the online setting as well.
Our contribution is an lower bound on the bandwidth overhead of any online ORAM, even if we require only computational security and allow arbitrary representations of data, thus greatly strengthening the lower bound of Goldreich and Ostrovsky in the online setting. Our lower bound applies to ORAMs with memory size and any word size . The bound therefore asymptotically matches the known upper bounds when
Lower Bounds for Multi-Server Oblivious RAMs
In this work, we consider the construction of oblivious RAMs (ORAM) in a setting
with multiple servers and the adversary may corrupt a subset of the servers.
We present an overhead lower bound for any -server
ORAM that limits any PPT adversary to distinguishing advantage at most when
only one server is corrupted. In other words, if one insists on
negligible distinguishing advantage, then multi-server ORAMs cannot
be faster than single-server ORAMs even with polynomially many servers
of which only one unknown server is corrupted.
Our results apply to ORAMs that may err with probability at most
as well as scenarios where the adversary corrupts larger subsets of servers.
We also extend our lower bounds to other important data structures
including oblivious stacks, queues, deques, priority queues and search trees
Lower Bounds for Encrypted Multi-Maps and Searchable Encryption in the Leakage Cell Probe Model
Encrypted multi-maps (EMMs) enable clients to outsource the storage of
a multi-map to a potentially untrusted server while maintaining the ability
to perform operations in a privacy-preserving manner. EMMs are an important
primitive as they are an integral building block for many practical applications
such as searchable encryption and encrypted databases.
In this work, we formally examine the tradeoffs between privacy and
efficiency for EMMs.
Currently, all known dynamic
EMMs with constant overhead
reveal if two operations
are performed on the same key or not that we denote as
the .
In our main result, we present strong evidence that the leakage of the
global key-equality pattern is inherent for
any dynamic EMM construction with efficiency.
In particular, we consider the slightly smaller leakage of
where leakage of
key-equality between update and query operations
is decoupled and the adversary only learns whether two operations of the
are performed on the same key or not. We show that
any EMM with at most decoupled key-equality pattern
leakage incurs overhead in the
.
This is tight as there exist ORAM-based constructions of EMMs with logarithmic slowdown that leak no more than the decoupled key-equality pattern (and actually, much less).
Furthermore, we present stronger lower bounds that
encrypted multi-maps leaking at most the decoupled key-equality pattern
but are able to perform one of either the update or query operations
in the plaintext still require overhead.
Finally, we extend our lower bounds to show that
dynamic, searchable encryption schemes
must also incur overhead even when one of either
the document updates or searches may be performed in the plaintext